
In this section, we will be using a Pipeline to score different classifier models like K-Nearest Neighbours and Multinomial Naive Bayes before finally settling on a final production model.
| title | selftext | author | num_comments | is_suicide | url | selftext_clean | title_clean | author_clean | selftext_length | title_length | megatext_clean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Our most-broken and least-understood rules is ... | We understand that most people who reply immed... | SQLwitch | 133 | 0 | https://www.reddit.com/r/depression/comments/d... | understand people reply immediately op invitat... | broken least understood rule helper may invite... | sql witch | 4792 | 144 | sql witch understand people reply immediately ... |
| 1 | Regular Check-In Post | Welcome to /r/depression's check-in post - a p... | circinia | 1644 | 0 | https://www.reddit.com/r/depression/comments/e... | welcome r depression check post place take mom... | regular check post | c irc | 650 | 21 | c irc welcome r depression check post place ta... |
| 2 | I hate it so much when you try and express you... | I've been feeling really depressed and lonely ... | TheNewKiller69 | 8 | 0 | https://www.reddit.com/r/depression/comments/f... | feeling really depressed lonely lately job ful... | hate much try express feeling parent turn arou... | new killer 69 | 1866 | 137 | new killer 69 feeling really depressed lonely ... |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1897 entries, 0 to 1896 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1897 non-null object 1 selftext 1897 non-null object 2 author 1897 non-null object 3 num_comments 1897 non-null int64 4 is_suicide 1897 non-null int64 5 url 1897 non-null object 6 selftext_clean 1897 non-null object 7 title_clean 1897 non-null object 8 author_clean 1897 non-null object 9 selftext_length 1897 non-null int64 10 title_length 1897 non-null int64 11 megatext_clean 1897 non-null object dtypes: int64(4), object(8) memory usage: 178.0+ KB
We will first calculate the baseline score for our models to "out-perform". A baseline score in the context of our project be the percentage of us getting it right if we predict that all our reddit posts are from the r/SuicideWatch subreddit.
0.5166051660516605
Before moving forward to creating a production model, we will run a Count Vectorizer + Naive Bayes model on different columns and score them. This will help us pick which one that we will use to build more models on.
In the context of our project, these are what the parameters in our confusion matrix represent:
True Positives (TP) - We predict that an entry is from the r/SuicideWatch subreddit and we get it right. As we are seeking to identify suicide cases, our priority is to get as many of these!
True Negatives (TN) - We predict that an entry is from the r/depression subreddit and we get it right. This also means that we did well.
False Positives (FP) - We predict that an entry is from the r/SuicideWatch subreddit and we get it wrong. Needless to say, this is undesirable.
False Negatives (FN) - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.
| series used (X) | model | AUC Score | precision | recall (sensitivity) | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | selftext | CountVec + MultinomialNB | 0.69 | 0.66 | 0.66 | {'TP': 161, 'FP': 78, 'TN': 152, 'FN': 84} | 0.92 | 0.66 | 0.52 | 0.66 | 0.66 |
| 1 | author | CountVec + MultinomialNB | 0.57 | 0.63 | 0.55 | {'TP': 235, 'FP': 204, 'TN': 26, 'FN': 10} | 0.99 | 0.55 | 0.52 | 0.11 | 0.45 |
| 2 | title | CountVec + MultinomialNB | 0.67 | 0.62 | 0.62 | {'TP': 167, 'FP': 104, 'TN': 126, 'FN': 78} | 0.85 | 0.62 | 0.52 | 0.55 | 0.62 |
| 3 | selftext_clean | CountVec + MultinomialNB | 0.69 | 0.67 | 0.67 | {'TP': 165, 'FP': 78, 'TN': 152, 'FN': 80} | 0.91 | 0.67 | 0.52 | 0.66 | 0.67 |
| 4 | author_clean | CountVec + MultinomialNB | 0.54 | 0.51 | 0.51 | {'TP': 169, 'FP': 155, 'TN': 75, 'FN': 76} | 0.95 | 0.51 | 0.52 | 0.33 | 0.50 |
| 5 | title_clean | CountVec + MultinomialNB | 0.67 | 0.63 | 0.62 | {'TP': 178, 'FP': 112, 'TN': 118, 'FN': 67} | 0.84 | 0.62 | 0.52 | 0.51 | 0.62 |
| 6 | megatext_clean | CountVec + MultinomialNB | 0.71 | 0.67 | 0.67 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.95 | 0.67 | 0.52 | 0.69 | 0.67 |
Based on a combination of scores from our modelling exercise above, we will proceed with megatext_clean -- a combination of our cleaned titles, usernames and posts -- as the column we will use to draw features from. Some reasons why:
Generalising Well - The model using megatext_clean's test set scored a 0.67 (the joint highest) while its training set score a 0.95.
High ROC Area Under Curve score - As our classes are largely balanced, it is suitable to use AUC Scores as a metric to measure the quality of our model's predictions. Our top choice performs best there.
Best recall/sensitivity score - This score measures the ratio of the correctly positive-labeled(is in r/SuicideWatch) by our program to all who are truly in r/SuicideWatch. As that is the target of our project, that the model performed well for this metric is important(and perhaps, most important) to us.
False Negatives (FN) - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.
Inspired by our earlier function, we will create a similar function that will run multiple permutations of models with Count, Hashing and TFID Vectorizers. The resulting metrics will be held neatly in a dataframe.
| model | AUC Score | precision | recall (sensitivity) | best params | best score | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cvec+ multi_nb | 0.72 | 0.67 | 0.67 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 1 | cvec + ss + knn | 0.60 | 0.58 | 0.58 | {'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.60 | {'TP': 144, 'FP': 100, 'TN': 130, 'FN': 101} | 0.72 | 0.58 | 0.52 | 0.57 | 0.58 |
| 2 | cvec + ss + logreg | 0.73 | 0.69 | 0.69 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 173, 'FP': 75, 'TN': 155, 'FN': 72} | 0.69 | 0.69 | 0.52 | 0.67 | 0.69 |
| model | AUC Score | precision | recall (sensitivity) | best params | best score | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cvec+ multi_nb | 0.72 | 0.67 | 0.67 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 1 | cvec + ss + knn | 0.60 | 0.58 | 0.58 | {'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.60 | {'TP': 144, 'FP': 100, 'TN': 130, 'FN': 101} | 0.72 | 0.58 | 0.52 | 0.57 | 0.58 |
| 2 | cvec + ss + logreg | 0.73 | 0.69 | 0.69 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 173, 'FP': 75, 'TN': 155, 'FN': 72} | 0.69 | 0.69 | 0.52 | 0.67 | 0.69 |
| 3 | tvec + multi_nb | 0.73 | 0.68 | 0.68 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 169, 'FP': 77, 'TN': 153, 'FN': 76} | 0.68 | 0.68 | 0.52 | 0.67 | 0.68 |
| 4 | tvec + ss + knn | 0.56 | 0.54 | 0.54 | {'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'} | 0.60 | {'TP': 146, 'FP': 118, 'TN': 112, 'FN': 99} | 0.74 | 0.54 | 0.52 | 0.49 | 0.54 |
| 5 | tvec + ss + logreg | 0.73 | 0.67 | 0.67 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 72, 'TN': 158, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
c:\python38\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
c:\python38\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
| model | AUC Score | precision | recall (sensitivity) | best params | best score | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cvec+ multi_nb | 0.72 | 0.67 | 0.67 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 1 | cvec + ss + knn | 0.60 | 0.58 | 0.58 | {'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.60 | {'TP': 144, 'FP': 100, 'TN': 130, 'FN': 101} | 0.72 | 0.58 | 0.52 | 0.57 | 0.58 |
| 2 | cvec + ss + logreg | 0.73 | 0.69 | 0.69 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 173, 'FP': 75, 'TN': 155, 'FN': 72} | 0.69 | 0.69 | 0.52 | 0.67 | 0.69 |
| 3 | tvec + multi_nb | 0.73 | 0.68 | 0.68 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 169, 'FP': 77, 'TN': 153, 'FN': 76} | 0.68 | 0.68 | 0.52 | 0.67 | 0.68 |
| 4 | tvec + ss + knn | 0.56 | 0.54 | 0.54 | {'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'} | 0.60 | {'TP': 146, 'FP': 118, 'TN': 112, 'FN': 99} | 0.74 | 0.54 | 0.52 | 0.49 | 0.54 |
| 5 | tvec + ss + logreg | 0.73 | 0.67 | 0.67 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 72, 'TN': 158, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 6 | hvec + multi_nb | 0.77 | 0.69 | 0.68 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.72 | {'TP': 148, 'FP': 54, 'TN': 176, 'FN': 97} | 0.89 | 0.68 | 0.52 | 0.77 | 0.68 |
| 7 | hvec + ss + knn | 0.51 | 0.75 | 0.52 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.52 | {'TP': 245, 'FP': 229, 'TN': 1, 'FN': 0} | 0.52 | 0.52 | 0.52 | 0.00 | 0.36 |
| 8 | hvec + ss + logreg | 0.65 | 0.62 | 0.62 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.63 | {'TP': 155, 'FP': 90, 'TN': 140, 'FN': 90} | 1.00 | 0.62 | 0.52 | 0.61 | 0.62 |
The Hashing Vectorizer + Multinomial Naive Bayes model out-performed other models on multiple metrics. Especially our much-prized AUC score(0.77) and the recall score(which measures our model's ability to predict True Positives well). Another notable performer is the TFID Vectorizer + Multinomial Naive Bayes combination. Apart from the joint-second-highest AUC score of 0.73, its consistent performance on both the test and training sets showed that the model generalises well.
Next Step: Tuning Hyperparameters - We'll now move on to make further moves to tweak our hyperparameters for both of these models.
| model | AUC Score | precision | recall (sensitivity) | best params | best score | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cvec+ multi_nb | 0.72 | 0.67 | 0.67 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 1 | cvec + ss + knn | 0.60 | 0.58 | 0.58 | {'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.60 | {'TP': 144, 'FP': 100, 'TN': 130, 'FN': 101} | 0.72 | 0.58 | 0.52 | 0.57 | 0.58 |
| 2 | cvec + ss + logreg | 0.73 | 0.69 | 0.69 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 173, 'FP': 75, 'TN': 155, 'FN': 72} | 0.69 | 0.69 | 0.52 | 0.67 | 0.69 |
| 3 | tvec + multi_nb | 0.73 | 0.68 | 0.68 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 169, 'FP': 77, 'TN': 153, 'FN': 76} | 0.68 | 0.68 | 0.52 | 0.67 | 0.68 |
| 4 | tvec + ss + knn | 0.56 | 0.54 | 0.54 | {'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'} | 0.60 | {'TP': 146, 'FP': 118, 'TN': 112, 'FN': 99} | 0.74 | 0.54 | 0.52 | 0.49 | 0.54 |
| 5 | tvec + ss + logreg | 0.73 | 0.67 | 0.67 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 72, 'TN': 158, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 6 | hvec + multi_nb | 0.77 | 0.69 | 0.68 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.72 | {'TP': 148, 'FP': 54, 'TN': 176, 'FN': 97} | 0.89 | 0.68 | 0.52 | 0.77 | 0.68 |
| 7 | hvec + ss + knn | 0.51 | 0.75 | 0.52 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.52 | {'TP': 245, 'FP': 229, 'TN': 1, 'FN': 0} | 0.52 | 0.52 | 0.52 | 0.00 | 0.36 |
| 8 | hvec + ss + logreg | 0.65 | 0.62 | 0.62 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.63 | {'TP': 155, 'FP': 90, 'TN': 140, 'FN': 90} | 1.00 | 0.62 | 0.52 | 0.61 | 0.62 |
| 9 | hvec + multi_nb(tuning) | 0.75 | 0.68 | 0.68 | {'hv__n_features': 1000, 'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.69 | {'TP': 170, 'FP': 75, 'TN': 155, 'FN': 75} | 0.82 | 0.68 | 0.52 | 0.67 | 0.68 |
| 10 | tvec + multi_nb(tuning) | 0.75 | 0.69 | 0.69 | {'tv__max_df': 0.4, 'tv__max_features': 70, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.68 | {'TP': 172, 'FP': 76, 'TN': 154, 'FN': 73} | 0.71 | 0.69 | 0.52 | 0.67 | 0.69 |
| model | AUC Score | precision | recall (sensitivity) | best params | best score | confusion matrix | train accuracy | test accuracy | baseline accuracy | specificity | f1-score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cvec+ multi_nb | 0.72 | 0.67 | 0.67 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 71, 'TN': 159, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 1 | cvec + ss + knn | 0.60 | 0.58 | 0.58 | {'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.60 | {'TP': 144, 'FP': 100, 'TN': 130, 'FN': 101} | 0.72 | 0.58 | 0.52 | 0.57 | 0.58 |
| 2 | cvec + ss + logreg | 0.73 | 0.69 | 0.69 | {'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'} | 0.65 | {'TP': 173, 'FP': 75, 'TN': 155, 'FN': 72} | 0.69 | 0.69 | 0.52 | 0.67 | 0.69 |
| 3 | tvec + multi_nb | 0.73 | 0.68 | 0.68 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 169, 'FP': 77, 'TN': 153, 'FN': 76} | 0.68 | 0.68 | 0.52 | 0.67 | 0.68 |
| 4 | tvec + ss + knn | 0.56 | 0.54 | 0.54 | {'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'} | 0.60 | {'TP': 146, 'FP': 118, 'TN': 112, 'FN': 99} | 0.74 | 0.54 | 0.52 | 0.49 | 0.54 |
| 5 | tvec + ss + logreg | 0.73 | 0.67 | 0.67 | {'tv__max_df': 0.3, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.65 | {'TP': 160, 'FP': 72, 'TN': 158, 'FN': 85} | 0.68 | 0.67 | 0.52 | 0.69 | 0.67 |
| 6 | hvec + multi_nb | 0.77 | 0.69 | 0.68 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.72 | {'TP': 148, 'FP': 54, 'TN': 176, 'FN': 97} | 0.89 | 0.68 | 0.52 | 0.77 | 0.68 |
| 7 | hvec + ss + knn | 0.51 | 0.75 | 0.52 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.52 | {'TP': 245, 'FP': 229, 'TN': 1, 'FN': 0} | 0.52 | 0.52 | 0.52 | 0.00 | 0.36 |
| 8 | hvec + ss + logreg | 0.65 | 0.62 | 0.62 | {'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.63 | {'TP': 155, 'FP': 90, 'TN': 140, 'FN': 90} | 1.00 | 0.62 | 0.52 | 0.61 | 0.62 |
| 9 | hvec + multi_nb(tuning) | 0.75 | 0.68 | 0.68 | {'hv__n_features': 1000, 'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.69 | {'TP': 170, 'FP': 75, 'TN': 155, 'FN': 75} | 0.82 | 0.68 | 0.52 | 0.67 | 0.68 |
| 10 | tvec + multi_nb(tuning) | 0.75 | 0.69 | 0.69 | {'tv__max_df': 0.4, 'tv__max_features': 70, 'tv__min_df': 2, 'tv__ngram_range': (1, 2), 'tv__stop_words': 'english'} | 0.68 | {'TP': 172, 'FP': 76, 'TN': 154, 'FN': 73} | 0.71 | 0.69 | 0.52 | 0.67 | 0.69 |
| 11 | hvec + multi_nb (tuning_2) | 0.76 | 0.68 | 0.68 | {'hv__n_features': 2000, 'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'} | 0.69 | {'TP': 167, 'FP': 74, 'TN': 156, 'FN': 78} | 0.84 | 0.68 | 0.52 | 0.68 | 0.68 |
| 12 | tvec + multi_nb (tuning_2) | 0.75 | 0.70 | 0.70 | {'tv__max_df': 0.5, 'tv__max_features': 70, 'tv__min_df': 2, 'tv__ngram_range': (1, 3), 'tv__stop_words': 'english'} | 0.68 | {'TP': 173, 'FP': 71, 'TN': 159, 'FN': 72} | 0.71 | 0.70 | 0.52 | 0.69 | 0.70 |
The model responded well to the tuning sessions. Although the Hashing model had a slightly better AUC score, I'd prioritise this model's high recall score as it will help predict potential suicide cases(True Positives) more accurately. This model is also proving to generalise pretty well with only a 0.01 variation from its Training to Test set scores.
Our production model is a combination of two models: TF-IDF and Multinomial Naive Bayes.
The first one, a TF-IDF (or “Term Frequency – Inverse Document” Frequency) Vectorizer, assigns scores to the words (or in our case, the top 70 words) in our selected feature. TF-IDF will penalise a word that appears too often in the document.
A matrix of "word scores" is then transferred into a Multinomial Naive Bayes classifier, which makes predictions based on the calculation of the probability of a given word falling into the a certain category.
ACCURACY: 0.6989473684210527 AUC SCORE: 0.7548003549245785
Results - The optimised model scored well on out test set, scoring an AUC score of 0.75 . We will proceed to understand our model a bit better before making final critiques and recommendations.